Towards Shared Datasets for Normalization Research
نویسندگان
چکیده
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data.
منابع مشابه
Normalization of qPCR array data: a novel method based on procrustes superimposition
MicroRNAs (miRNAs) are short, endogenous non-coding RNAs that function as guide molecules to regulate transcription of their target messenger RNAs. Several methods including low-density qPCR arrays are being increasingly used to profile the expression of these molecules in a variety of different biological conditions. Reliable analysis of expression profiles demands removal of technical variati...
متن کاملContext Tailoring for Text Normalization
Language processing tools suffer from significant performance drops in social media domain due to its continuously evolving language. Transforming non-standard words into their standard forms has been studied as a step towards proper processing of ill-formed texts. This work describes a normalization system that considers contextual and lexical similarities between standard and non-standard wor...
متن کاملTowards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment
This paper reports on the organization of a new kind of shared task that outsources the creation of evaluation resources to its participants. We introduce the concept of data submissions for shared tasks, and we use our previous shared task on text alignment as a testbed. A total of eight evaluation datasets have been submitted by as many participating teams. To validate the submitted datasets,...
متن کاملDoes Normalization Methods Play a Role for Hyperspectral Image Classification?
For Hyperspectral image (HSI) datasets, each class have their salient feature and classifiers classify HSI datasets according to the class's saliency features, however, there will be different salient features when use different normalization method. In this letter, we report the effect on classifiers by different normalization methods and recommend the best normalization methods for classifier...
متن کاملFunctional connectivity change as shared signal dynamics.
BACKGROUND An increasing number of neuroscientific studies gain insights by focusing on differences in functional connectivity-between groups, individuals, temporal windows, or task conditions. We found using simulations that additional insights into such differences can be gained by forgoing variance normalization, a procedure used by most functional connectivity measures. Simulations indicate...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014